Method and apparatus for performing a carry-save division operation

ABSTRACT

One embodiment of the present invention provides a system that performs a carry-save division operation that divides a numerator, N, by a denominator, D, to produce an approximation of the quotient, Q=N/D. The system approximates Q by iteratively selecting an operation to perform based on higher order bits of a remainder, r, and then performing the operation, wherein the operation can include, subtracting D from r and adding a coefficient c to a quotient calculated thus far q, or adding D to r and subtracting c from q. These subtraction and addition operations maintain r and q in carry-save form, which eliminates the need for carry propagation and thereby speeds up the division operation. Furthermore, the selection logic is simpler than previous SRT division implementations, which provides another important speed up.

BACKGROUND

[0001] 1. Field of the Invention

[0002] The present invention relates to techniques for performingmathematical operations within computer systems. More specifically, thepresent invention relates to a method and an apparatus for efficientlyperforming a carry-save division operation in circuitry within acomputer system.

[0003] 2. Related Art

[0004] In order to keep pace with continually increasing microprocessorclock speeds, computational circuitry within the microprocessor coremust perform computational operations at increasingly faster rates. Oneof the most time-consuming computational operations that can beperformed within a computer system is a division operation. A divisionoperation involves dividing a numerator, N, by a denominator, D, toproduce a resulting approximation of quotient, Q, wherein Q=N/D.

[0005] Computer systems often perform division operations using avariant of the SRT technique, which iteratively performs subtractionoperations on a remainder, R, to retire a fixed number of quotient bitsin each iteration. (The SRT technique is named for Sweeny, Robertson andTocher, who each independently developed the technique at about the sametime.)

[0006] Unfortunately, each iteration of the SRT division techniqueinvolves performing addition and/or subtraction operations that requiretime-consuming carry completions and selection logic to decide whichoperations to perform. Hence, hardware implementations of the SRTdivision technique tend to be relatively slow.

[0007] What is needed is a method and an apparatus for performing adivision operation that takes less time than the SRT technique.

SUMMARY

[0008] One embodiment of the present invention provides a system thatperforms a carry-save division operation that divides a numerator, N, bya denominator, D, to produce an approximation of the quotient, Q=N/D.The system approximates Q by iteratively selecting an operation toperform based on higher order bits of a remainder, r, and thenperforming the operation, wherein the operation can include, subtractingD from r and adding a coefficient c to a quotient calculated thus far q,or adding D to r and subtracting c from q. These subtraction andaddition operations maintain r and q in carry-save form, whicheliminates the need for carry propagation and thereby speeds up thedivision operation. Furthermore, the selection logic is simpler thanprevious SRT division implementations, which provides another importantspeed up.

[0009] In a variation on this embodiment, maintaining r in carry-saveform involves maintaining a sum component, r_(s), and a carry component,r_(c).

[0010] In a further variation, maintaining q in carry-save form involvesmaintaining a sum component, q_(s), and a carry component, q_(c).

[0011] In a further variation, the system initializes r, q and c bysetting r_(s)=R and r_(c)=0; setting q_(s)=0 and q_(c)=0; and settingc=1.

[0012] In a further variation, after the iterations are complete, thesystem performs a carry completion addition that adds q_(s) and q_(c) togenerate q in non-redundant form.

[0013] In a variation on this embodiment, the operation can includemultiplying both r_(s) and r_(c) by 2 and dividing c by 2.

[0014] In a variation on this embodiment, the operation can includemultiplying both r_(s) and r_(c) by 2, dividing c by 2, and invertingthe most significant bits of r_(s) and r_(c).

[0015] In a variation on this embodiment, the operation can includemultiplying both r_(s) and r_(c) by 4, dividing c by 4 and theninverting the most significant bits of r_(s) and r_(c).

[0016] In a variation on this embodiment, the operation can includesubtracting D from r_(s) and r_(c), adding c to q_(s) and q_(c),multiplying both r_(s) and r_(c) by 2, dividing c by 2, and theninverting the most significant bits of r_(s) and r_(c).

[0017] In a variation on this embodiment, the operation can includesubtracting 2D from r_(s) and r_(c), adding 2c to q_(s) and q_(c),multiplying both r_(s) and r_(c) by 2, dividing c by 2, and theninverting the most significant bits of r_(s) and r_(c).

[0018] In a variation on this embodiment, the operation can includeadding D to r_(s) and r_(c), subtracting c from q_(s) and q_(c),multiplying both r_(s) and r_(c) by 2, dividing c by 2, and theninverting the most significant bits of r_(s) and r_(c).

[0019] In a variation on this embodiment, the operation can includeadding 2D to r_(s) and r_(c), subtracting 2c from q_(s) andq_(c),multiplying both r_(s) and r_(c) by 2, dividing c by 2, and theninverting the most significant bits of r_(s) and r_(c).

BRIEF DESCRIPTION OF THE FIGURES.

[0020]FIG. 1 illustrates a set of regions defined by higher-order bitsof sum and carry words for the remainder in accordance with anembodiment of the present invention.

[0021]FIG. 2 illustrates the effect of carry-save addition andsubtraction operations in accordance with an embodiment of the presentinvention.

[0022]FIG. 3A illustrates a set of regions defined by higher-order bitsof sum and carry words for a remainder in accordance with anotherembodiment of the present invention.

[0023]FIG. 3B illustrates a set of regions defined by higher-order bitsof sum and carry words for a remainder in accordance with yet anotherembodiment of the present invention.

[0024]FIG. 4A illustrates the effect of carry-save addition andsubtraction operations in accordance with an embodiment of the presentinvention.

[0025]FIG. 4B the effect of carry-save addition and subtractionoperations in accordance with an embodiment of the present invention.

[0026]FIG. 5 illustrates a set of regions defined by higher-order bitsof sum and carry words for a remainder in accordance with an embodimentof the present invention.

[0027]FIG. 6 illustrates a set of regions defined by higher-order bitsof sum and carry words for a remainder in accordance with anotherembodiment of the present invention.

[0028]FIG. 7 illustrates a possible hardware implementation of acarry-save division circuit in accordance with an embodiment of thepresent invention.

[0029]FIG. 8 illustrates another possible hardware implementation of acarry-save division circuit in accordance with an embodiment of thepresent invention.

[0030]FIG. 9 illustrates yet another possible hardware implementation ofa carry-save division circuit in accordance with an embodiment of thepresent invention.

DETAILED DESCRIPTION

[0031] The following description is presented to enable any personskilled in the art to make and use the invention, and is provided in thecontext of a particular application and its requirements. Variousmodifications to the disclosed embodiments will be readily apparent tothose skilled in the art, and the general principles defined herein maybe applied to other embodiments and applications without departing fromthe spirit and scope of the present invention. Thus, the presentinvention is not intended to be limited to the embodiments shown, but isto be accorded the widest scope consistent with the principles andfeatures disclosed herein.

[0032] The division operation computes an approximation for Q=C*R/D,where Q is the quotient, D is the denominator (divisor) and C*R is thenumerator (dividend). Normally for a division we have C=1. Here,however, the task is to compute the result of a multiplication and adivision at the same time. Notice that, when we choose D=1, thetechnique computes the multiplication C*R.

[0033] Because we are interested in a hardware implementation, we makesome assumptions about the ranges of C, R, and D. We assume that,

C∈[0,2^(K))  (1)

R∈[−^(K+1), 2^(K+1))  (2)

D∈[2^(K), 2^(K+1))  (3)

[0034] For binary representations of C, R, and D, these assumptions canbe satisfied by performing the appropriate shift operations before thestart of the division operation. Notice that for these assumptions Qwill be in the same range as R, that is, Q=C*R/D∈[−2^(K+1), 2^(K+1)).Finally, we require that the error in our approximation of the quotientQ is less than 2^(31 L).

[0035] Technique A

[0036] The formula Q*D=C*R expresses the desired relation between Q, D,C, and R. In our first technique, called Technique A, we use variablesq, r, and c. The invariant for these variables is,

q*D+c*r=C*R  (4)

[0037] wherein the variable q represents the quotient calculated “thusfar,” and r represents the remainder “thus far.”

[0038] Technique A appears below. (Note that conditions B0 through B3are defined later.) q:=0; c:=C; r:=R; n:=0;  while B0 do {   if B1 then{r:=r*2; c:=c/2; n:=n+1}   elseif B2 then {r:=r−D; q:=q+c}   elseif B3then {r:=r+D; q:=q−c} }

[0039] When we represent r and c by binary numbers, we can easilyimplement the statements r:=r*2; c:=c/2 by shift operations on r and c.We use the variable n to count the number shifts on c.

[0040] The initialization q:=0; c:=C; r:=R establishes invariant (4)before the start of the iterations in Technique A.

[0041] Furthermore, each of the statements

r:=r*2; c:=c/2

r:=r−D; q:=q+c

r:=r+D; q:=q−c

[0042] maintains invariant (4), irrespective of the conditions B0through B3. For example, if (4) holds before statement r:=r+D; q:=q−c,then after execution of this statement we have

(q−c)*D+c*(r+D)=q*D+c*r

[0043] Thus, invariant (4) also holds after the statement.

[0044] Note that B0 through B3 can be selected in a number of differentways. The following choices yield Technique A.

B0=n≦K+L+1

B1=(−D<r<D)

B2=(r≧D)

B3=(r≦−D)

[0045] The choice for the termination condition B0 can be explained asfollows. Because of the initial conditions on R and D and the conditionsB1 through B3, Technique A has as additional invariant

|r|<2*D  (5)

[0046] Notice that none of the statements in the repetition violatesinvariant (5).

[0047] Technique A is guaranteed to terminate, because each repetitionstep without a shift operation is followed by a repetition step with ashift operation. In other words, a repetition step with a subtractionand addition creates a state where condition B1 applies. Consequently, nincreases at least every two repetition steps, and thus Technique A willterminate.

[0048] Assuming a random distribution of C and R, the average number ofadditions and subtractions per shift is 0.5. Phrased differently, foreach addition or subtraction, there will be two shifts on average.

[0049] Technique B

[0050] Technique B arises when we choose more efficient conditions forB1 through B3. Testing whether r<D requires a comparison, which ingeneral incurs many gate delays. However, testing whether r ∈[−2^(K),2^(K)) for some K can be much faster for a two's complementrepresentation of r, in particular if K is the position of the mostsignificant or second-most-significant bit.

[0051] Technique B maintains as an invariant not only property (4), butalso the property

r ∈[−2^(K+1),2^(K+1))  (6)

[0052] The choices for B0 through B3 are as follows. Recall that B0 isthe termination condition, B1 is the condition for doubling r, B2 is thecondition for subtracting D, and B3 is the condition for adding D.

B 0=n≦K+L+1

B 1=r∈[−2^(K), 2^(K))

B 2=r∈[2^(K), 2^(K+1))

B 3=r∈[−2^(K+1),−2^(K))

[0053] Recall that property (4) remains an invariant of Technique B,because the choices for B0 through B3 have no effect on the validity ofthe invariant. Secondly, notice that, with these choices for B1 throughB3, none of the statements in Technique B violates invariant(6).

[0054] Our termination condition B0 may remain the same, becauseinvariant (6) and the initial condition 2^(K)<D guarantee that |r|≦2*Dis also an invariant of Technique B. Accordingly, the reasoning aboutthe termination conditions for Technique A also applies to Technique B.

[0055] Although the choice for termination condition B0 has not changed,the choices for B1 through B3 have changed and have an effect on theefficiency of the technique. Tests B1 through B3 for technique B aremuch faster than the tests for Technique A. Moreover, Technique B mayexecute fewer additions or subtractions on average per shift operation.When D=2^(K), the average number of additions and subtractions per shiftis ½, as for Technique A. When D approaches 2^(K+1), the average numberof additions and subtractions per shift turns out to approach ½ as well.However, when D=3*2^(K−1), the average number of additions andsubtractions per shift turns out to be ⅓. These values are the extremesfor the average number of additions and subtractions per shift forTechnique B and a fixed D. Consequently, the average number of additionsand subtractions per shift for any D will be somewhere between ½ and ⅓.

[0056] Note that Technique B is a slight generalization of thewell-known SRT division technique. This generalization involvesconsidering a general C instead of C=1.

[0057] Technique C

[0058] The third technique attempts to reduce the execution time evenfurther by speeding up the additions and subtractions. The addition andsubtraction operations are the only operations that may have room for apossible speed up. This is because Technique A already has an efficienttermination condition, and Technique B already speeds up the process ofselecting between a shift, an addition, or a subtraction as the nextoperation.

[0059] Technique C achieves a speed-up by keeping the remainder r andthe quotient q in carry-save form. That is, instead of a singleremainder r and a single quotient q, we have a pair, r₀, r₁, and a pair,q₀, q₁, where r₀+r₁=r and q₀+q₁=q. The pairs r₀, r₁ and q₀, q₁ areproduced by full carry-save adders, each of which produce a sum bit anda carry bit, also called the parity and majority bit respectively. Onevariable, r₀, represents all the sum bits and the other variable, r₁,represents all the carry bits. By storing r in carry-save form, theimplementation does not need to resolve the carry bits for eachaddition, which is a computation that takes an amount of timeproportional to the logarithm of the number of bits in the worst case.

[0060] The invariant for the division operation is as follows:

(q ₀ +q ₁)*D+c*(r ₀ +r ₁)=C*R  (7)

[0061] The following ranges apply for r₀ and r₁:

r ₀∈[−2^(K+2), 2^(K+2)) and r ₁∈[−2^(K+2), 2^(K+2))

[0062] Furthermore, we have as an invariant the following property.

r ₀ +r ₁ ∈[−2^(K+2), 2^(K+2))  (8)

[0063]FIG. 1 shows all points (r₀, r₁) within the required boundaries.The complete region in FIG. 1 between the lines r₀+r₁=−2^(K+2) andr₀+r₁=2^(K+2) is divided into basically six sub-regions: (1) the regionT0; (2) the region T1; (3) the union of the regions X0, X1, X2, and X3;(4) the region ADD; (5) the region SUB; and (6) the rest. Each of thefirst five regions causes a different operation to be performed on r₀,r₁ and the other variables in Technique C. The rest region turns out notto play a role.

[0064] We assume that each region includes the lower bounds for the r₀and r₁ coordinates and excludes the upper bounds. This choice turns outto fit well with a two's complement representation of r₀ and r₁.

[0065] Technique C uses a carry-save addition add(x,y,z) that takesthree inputs and returns two results add₀(x,y,z) and add₁(x,y,z). Thefunction add satisfies

add₀(x,y,z)+add₁(x,y,z)=x+y+z  (9)

[0066] where add₀ is the parity function, or “sum” function, and add₁ isthe majority function, or “carry” function. We denote in Technique C anassignment using this addition function as

r0,r1:=add(x,y,z)

[0067] The meaning of this notation is that r₀ is assigned the valueadd_(o)(x,y,z) and r₁ is assigned the value add₁(x,y,z).

[0068] Technique C appears below. We have used the labels of FIG. 1 tospecify the conditions in the technique. The notation X0++X1 denotes theunion of the regions X0 and X1. Technique C: q0:=0; q1:=0; c:=C; r0:=R;r1:=0; n:=0; while (n <= K+L+2) do {  if ((r0,r1) in X0++X1++X2++X3)then   {r0,r1:=r0*2,r1*2; c:=c/2; n:=n+1}  elseif ((r0,r1) in SUB) then  {r0,r1:=add(r0,r1,−D); q0,q1:=add(q0,q1, c)}  elseif ((r0,r1) in ADD)then   {r0,r1:=add(r0,r1, D); q0,q1:=add(q0,q1,−c)  elseif ((r0,r1) inT0) then   {r0,r1:=r0+2{circumflex over ( )}(K+1),r1−2{circumflex over( )}(K+1)}  elseif ((r0,r1) in T1) then   {r0,r1:=r0−2{circumflex over( )}(K+1),r1+2{circumflex over ( )}(K+1)} }

[0069] Note that any point in region T0 is translated over(2^(K+1),−2^(K+1)), whereas any point in region T1 is translated over(−2^(K+1),2^(K+1)).

[0070] Stay Within Bold Inner Square

[0071] The first optimization to Technique C is the combination of somerepetition steps such that the result of each repetition step is again apoint in the bold inner square of FIG. 1. The bold inner square is theunion of the regions X0, X1, X2, X3, SUB, and ADD. If each repetitionstep yields points that are within the inner bold square, we caneliminate the tests for the translations from the technique. This doesnot mean that no translations occur. In fact, any necessary translationsare merged into other repetition steps.

[0072] Another benefit of staying in the inner square is that in a two'scomplement representation of each point in the inner square the two mostsignificant bits are always the same. In other words, we can just aswell omit the most significant bit.

[0073] The only operations in Technique C that return points outside thebold inner square are doublings from regions X0 and X1, additions fromregion ADD, and subtractions from region SUB. Let us look at thedoublings from regions X0 and X1 first. Notice that after executing adoubling for the regions X0 and X1, Technique C performs a translationfor points in region T0. Instead of translating any point in region T0,we can just as well translate any point in region T0 and X0. In otherwords, we can translate any point that is a result of a doubling from apoint in region X0. Any doubling of region X0 followed by a translationover (2^(K+1), −2^(K+1)) in effect expands region X0 to the bold innersquare. Similarly, any doubling of region X1 followed by a translationover (2^(K+1), −2^(K+1)) in effect expands region X1 to the bold innersquare.

[0074] Now let us look at additions and subtractions. Note thatcarry-save additions and subtractions may return points outside the boldinner square. For example, subtracting D from any point in region S0 inFIG. 2 returns a point in region TS0, which is outside the bold innersquare. Subtracting D from any point in region S1 or S1′ in FIG. 2,however, returns a point in region TS1, which is inside the bold innersquare. Because region TS0 is inside region T0 of FIG. 1, Technique Ctranslates each point in region TS0 to a point in regions S1 or X2.Because TS1 is inside the bold inner square, Technique C does nottranslate points from region TS1. Notice, however, that if you translateregion TS1 over (2^(K+1), −2^(K+1)), the result still ends up inside theinner square. Notice also that this translation does not invalidate anyof our invariants (4) and (5). Consequently, if we choose to follow eachsubtraction of D from any point in region SUB by a translation over(2^(K+1), −2^(K+1)), the result ends up inside the bold inner square.For reasons of symmetry, similar remarks can be made for the additions,but this time the translations are over (2^(K+1), −2^(K+1)).

[0075] The following technique, called Technique D, incorporates theoptimizations discussed in this section. Each doubling from X0 or X1 isfollowed by a translation and each addition or subtraction is followedby a translation. Technique D has an invariant that is stronger thaninvariant (5), viz., (r₀, r₁) is always contained within the bold innersquare, where lower bounds are included and upper bounds are excluded,in formula,

r₀∈[−2^(K+1),−2^(K+1)) and r₁∈[−2^(K+1),−2^(K+1)).

[0076] Because of this last invariant, we can eliminate the tests fortranslations entirely. A description of Technique D appears below. Notethat we use the same labels as for the regions of FIG. 1. As before, thenotation X2++X3 denotes the union of regions X2 and X3. Technique D:q0:=0;q1:=0;c:=C;r0:=R;r1:=0;n:=0; while (n<=K+L+2) do {  if((r0,r1) inX2++X3) then   {r0,r1:=r0*2,r1*2; c:=c/2; n:=n+1}  elseif ((r0,r1) inX0) then   {r0,r1:=r0*2, r1*2; c:=c/2; n:=n+1;    r0,r1:=r0+2

(K+1),r1−2

(K+1)}  elseif ((r0,r1) in X1) then   {r0,r1:=r0*2,r1*2; c:=c/2; n:=n+1;   r0,r1:=r0−2

(K+1),r1+2

(K+1)}  elseif ((r0,r1) in SUB) then  {r0,r1:=add(r0,r1−D);q0,q1:=add(q0,q1, c);    r0,r1:=r0+2

(K+1),r1−2

(K+1)}  elseif ((r0,r1) in ADD) then   {r0,r1:=add(r0,r1, D);q0,q1:=add(q0,q1,−c);    r0,r1:=r0−2

(K+1),r1+2

(K+1)} }

[0077] Implementing Translations

[0078] If we assume a two's complement representation of K+3non-fractional bits for r₀ and r₁, translations over (t,−t) and (−t,t),with t=2^(K+1), to points inside the bold inner square are easy toimplement. Both translations amount to inverting the second-mostsignificant bit and, because the results are in the inner square, makingthe most significant bit equal to the second-most significant bit.Notice that in a binary representation where K+2 and K+1 are thepositions of the most and second-most significant bits, the translationsover 2^(K+1) and −2^(K+1) involve the manipulation of these two mostsignificant bits only.

[0079] For a translation over +2^(K+1) to a point in the bold innersquare, the two most significant bits change as follows, 10→11 and11→00.

[0080] For a translation over −2^(K+1) to a point in the bold innersquare, the two most significant bits change as follows, 01→00 and00→11.

[0081] Notice that the second-most significant bit in each case changesand the most significant bit is a copy of the second-most significantbit.

[0082] Because of these observations, we can re-phrase Technique D asfollows, again using the region labels of FIG. 1. Technique E: q0:=0;q1:=0; c:=C; r0:=R; r1:=0; n:=0; while (n <= K+L+2) do {  if ((r0,r1) inX2++X3) then   {r0,r1:=r0*2,r1*2; c:=c/2; n:=n+1}  elseif (r0,r1) inX0++X1) then   {r0,r1:=r0*2,r1*2; c:=c/2; n:=n+1;    invert(K+1, r0,r1)}  elseif ((r0,r1) in SUB) then   {r0,r1:=add(r0,r1,−D);q0,q1:=add(q0,q1, c);     invert(K+1, r0, r1)}  elseif ((r0,r1) in ADD)then   {r0,r1:=add(r0,r1, D); q0,q1:=add(q0,q1,−c);    invert(K+1, r0,r1)} }

[0083] Where invert (K+1, r0, r1) means “invert bit K+1 in r₀ and r₁ andmake bit K+2 equal to bit K+1.” Because both translations in Technique Dcan be implemented in the same way, viz., the inversion of bit K+1,points in regions X0 and X1 undergo the same operations in Technique E.

[0084] Because bit K+2 and bit K+1 are always the same, we can just aswell omit bit K+2. Thus, bit K+1 becomes the most-significant bit. If weomit bit K+2, we can illustrate the technique by means of the innersquare only. FIG. 3A illustrates Technique E, where ADD*means additionof D followed by inversion of bit K+1, SUB*means subtraction of Dfollowed by inversion of bit K+1, 2×* means doubling followed byinversion of bit K+1, and 2× means doubling without inversion of bitK+1.

[0085] There is an alternative to Technique E, called technique F, whichis illustrated in FIG. 3B. Here the region 2X is larger than inTechnique E but the regions 2X* are smaller. Although the operations inTechnique E and F are the same, the tests for membership in any of theregions are different. The efficiencies of the implementations of thesetests may well decide which technique is fastest.

[0086] Adding or Subtracting 2*D

[0087] In order to modify our carry-save division to allow for theaddition or subtraction of 2*D as well as D, we distinguish the foursquares S0, S1, S1′, and X2 in the north-east and the four squares A0,A1, A1′ and X3 in the south-west corners as illustrated in FIGS. 4A and4B.

[0088] Subtracting D from any point in region S1 or S1′ yields a pointin region TS1, as illustrated in FIG. 4A. Subtracting 2*D from any pointin region S0, however, yields a point in region TS0. Notice that thisregion is outside the bold inner square. Here is the calculation for thesubtraction of 2*D. First, recall that in a two's complementrepresentation with K+3 bits D=001x. Thus, 2*D=01x, and −2*D isrepresented by the bit-wise complement of 2*D plus 1 at theleast-significant bit position, i.e., −2*D=10y+1, where y is thebit-wise complement of x. $\begin{matrix}\underset{\_}{\begin{matrix}{r0} & 001 \\{r1} & 001 \\{{- 2}D} & {{10y} + 1}\end{matrix}} \\\begin{matrix}{parity} & {10?} \\{majority} & {01?}\end{matrix}\end{matrix}$

[0089] As a consequence the result of subtracting 2*D from any point inregion S0 is a point (r₀, r₁), where the two most-significant bits of r₀are 10 and the two most-significant bits of r₁ are 01. This point liesin region TS0 of FIG. 4A.

[0090] After a translation over (2^(K+1), −2^(K+1)), regions TS1 and TS0end up inside the inner square, as illustrated in FIG. 4B. Accordingly,if each subtraction of D from points in regions S1 and S1′ and eachsubtraction of 2*D from points in region S0 is followed by a translationover (2^(K+1), −2^(K+1)), the result remains within the bold innersquare.

[0091] There is another important observation that can be made fromFIGS. 4A and 4B. After subtracting D or 2*D and a translation, any pointin region S0, S1, or S1′, ends up in region TS0 or TS1 of FIG. 4B.Because regions TS0 and TS1 are within region X2* of FIG. 3A, in thenext repetition step, each of these regions may undergo a doubling andanother translation. In summary, each subtraction of D from points inregions S1 or S1′ and each subtraction of 2*D from points in region S0will be followed by a translation, a doubling, and another translation,in that order.

[0092] In an implementation using only K+1 non-fractional bits, eachtranslation is an inversion of the most significant bit and eachdoubling is a binary shift. In effect, a translation followed by adoubling and then another translation is the same as a doubling followedby a translation, because each doubling throws away the most significantbit. So there is no need to do a translation after an addition andbefore a doubling, because the bit that gets changed in the translationwill be thrown away anyway in the following doubling.

[0093] For reasons of symmetry, the same reasoning applies to additionsof D to points in region A1 or A1′ and addition of 2*D to points inregion A0. In summary, every subtraction and addition can be followed bya doubling and a translation. As a result, we obtain the followingdivision technique. Technique G: q0:=0; q1:=0; c:=C; r0:=R; r1:=0; n:=0;while (n <= K+L+2) do {  if ((r0,r1) in 2X) then   {r0,r1:=r0*2,r1*2;c:=c/2; n:=n+1}  elseif ((r0,r1) in 2X*) then   {r0,r1:=r0*2,r1*2;c:=c/2; n:=n+1;    invert(K+1, r0, r1)}  elseif ((r0,r1) in SUB1) then  {r0,r1:=add(r0,r1,−D); q0,q1:=add(q0,q1, c);    r0,r1:=r0*2,r1*2;c:=c/2; n:=n+1;   invert(K+1, r0, r1)}  elseif ((r0,r1) in SUB2) then  {r0,r1:=add(r0,r1,−2*D);q0,q1:=add(q0,q1,2*c);    r0,r1:=r0*2,r1*2;c:=c/2; n:=n+1;   invert(K+1, r0, r1)}  elseif ((r0,r1) in ADD1) then  {r0,r1:=add(r0,r1, D); q0,q1:=add(q0,q1,−c);    r0,r1:=r0*2,r1*2;c:=c/2; n:=n+1;   invert(K+1, r0, r1)}  elseif ((r0,r1) in ADD2) then  {r0,r1:=add(r0,r1,2*D);q0,q1:=add(q0,q1,−2*c);    r0,r1:=r0*2,r1*2;c:=c/2; n:=n+1;    invert(K+1, r0, r1)} }

[0094]FIG. 5 illustrates the regions 2X, 2X*, SUB1, SUB2, ADD1, andADD2. In region 2X, each point undergoes a doubling; in region 2X* eachpoint undergoes a doubling followed by an inversion of themost-significant bit. In region SUB1, each point undergoes a subtractionof D followed by a doubling and finally an inversion of the mostsignificant bit. In region SUB2, each point undergoes a subtraction of2D followed by a doubling and finally an inversion of the mostsignificant bit. In region ADD1, each point undergoes an addition of Dfollowed by a doubling and finally an inversion of the most significantbit. Finally, in region ADD2, each point undergoes an addition of 2Dfollowed by a doubling and finally an inversion of the most significantbit.

[0095] Because each addition and subtraction is followed by a doubling,this technique makes exactly K+L+3 repetition steps, which is the numberof doublings necessary for each of the techniques to terminate. Thetests for membership in each of the regions are simple and rely only onthe two most significant bits of r₀ and r₁.

[0096] Technique H

[0097] Another technique H considers seven alternatives in eachrepetition step. These alternatives correspond to the regions of FIG. 6.Here the actions for region 4X* are a quadrupling of the carry and sumof the remainder and a division by four of c followed by an inversion ofthe most-significant bit of carry and sum of remainder. The reason forthe quadrupling and inversions is as follows. Recall that the operation2X* on a region in the northwest or in the southeast quadrant is thesame as a scaling by 2X of the region with the upper left or lower rightcorner, respectively, as the center for the scaling. Accordingly, if atechnique executes the operation 2X* twice for each of the small squareslabeled 4X*, these regions map exactly to the large complete square.Finally, notice that executing the operation 2X* twice is the same asthe operation 4X*, because the most-significant bits after the first 2X*operation are shifted out during the second 2X* operation. Thus, if thesecond operation is 2X*, it does not matter whether the first operationis a 2X operation or a 2X* operation. Incorporating the 4X* operation inTechnique G gives Technique H symbolized by FIG. 6. The completetechnique appears at the end of this section.

[0098] Having the regions 4X* in Technique H may reduce the total numberof repetition steps. How large the reduction is depends on how oftenTechnique H encounters a remainder in a 4X* square.

[0099] The price to pay for this potential reduction may be a smallincrease in the average duration of a repetition step. Because of theextra alternative, the selection logic, which determines whichalternative the technique executes, becomes slightly more complex, andthe extra alternative may slow down slightly some multiplexer in animplementation. The combination of these two factors may increase theduration of each repetition step slightly. Technique H will be animprovement over Technique G if the decrease in execution time due tothe reduction in repetition steps is larger than the increase inexecution time due to a larger average duration of the repetition step.Technique H: q0:=0; q1:=0; c:=C; r0:=R; r1:=0; n:=0; while (n <= K+L+2)do {  if(r0,r1) in 2X) then   {r0,r1:=r0*2,r1*2; c:=c/2; n:=n+1 } elseif ((r0,r1) in 2X*) then   {r0,r1:=r0*2,r1*2; c:=c/2; n:=n+1;  invert(K+1, r0, r1)}  elseif ((r0,r1) in 4X*) then  {r0,r1:=r0*4,r1*4; c:=c/4; n:=n+1;    invert(K+1, r0, r1)}  elseif((r0,r1) in SUB1) then   {r0,r1:=add(r0,r1,−D);q0,q1:=add(q0,q1, c);   r0,r1:=r0*2,r1*2; c:=c/2; n:=n+1;   invert(K+1, r0, r1)}  elseif((r0,r1) in SUB2) then   {r0,r1:=add(r0,r1,−2*D);q0,q1:=add(q0,q1, 2*c);   r0,r1:=r0*2,r1*2; c:=c/2; n:=n+1;    invert(K+1, r0, r1)}  elseif((r0,r1) in ADD1) then   {r0,r1:=add(r0,r1, D);q0,q1:=add(q0,q1,−c);   r0,r1:=r0*2,r1*2; c:=c/2; n:=n+1;    invert(K+1, r0, r1)}  elseif((r0,r1) in ADD2) then   {r0,r1:=add(r0,r1,2*D);q0,q1:=add(q0,q1,−2*c);   r0,r1:=r0*2,r1*2; c:=c/2; n:=n+1;    invert(K+1, r0, r1)} }

[0100] Implementations

[0101]FIGS. 7-9 present implementations of three of the above-describeddivision techniques illustrating the operations on the remainder. All ofthese figures provide a rough schematic showing the elementary modulesin an implementation. These modules are a carry-save adder, indicated by“CSA,” a multiplexer, indicated by a trapezoid labeled MUX, theselection logic, indicated by SLC, and the implementations of the otheractions of the techniques, indicated by 2X, 2X*, 4X*, or just *. An ovalwith a single star (*) represents the implementation that only invertsthe most-significant bit of the sum and carry. An oval with the label2X* implements a left-shift by one followed by an inversion of the mostsignificant bit of sum and carry. An oval with the label 2X representsjust a left-shift by one.

[0102] These figures do not show the accumulation of quotient digits orany other operations on the quotient. The figures also do not showimplementations of any post-processing steps, like the implementation ofany restoration step, rounding, or conversion that must occur for thequotient after termination of the technique. These may be implementedusing any one of a number of standard techniques.

[0103]FIG. 7 shows an implementation of Technique E. This implementationincludes two carry-save adders (one for adding D and one for adding −D)and a 4-to-1 multiplexer. FIG. 3A represents the regions that must bedetected by the corresponding selection logic to provide the correctinput to the multiplexer. Technique F can be implemented in a similarmanner.

[0104] Although the implementation shows a 4-to-1 multiplexer, theactual implementation may be closer to a 3-to-1 multiplexer. Recall thatthe results of the operations 2X and 2X* are the same except for themost significant bit of sum and carry. Thus, the equivalent parts of the2X and 2X* inputs of the multiplexer can be combined. This merging alsoreduces the capacitance on the select input of the multiplexer.

[0105]FIG. 8 illustrates an implementation of Algorithm H. Thisimplementation includes four carry-save adders for adding −2D, −D, +D,or +2D, and a 7-to-1 multiplexer. FIG. 6 illustrates the regions thatmust be detected by the selection logic. The oval with label 4X*implements a left-shift by two followed by an inversion of the mostsignificant bit of sum and carry. Similar to the previousimplementation, the 7-to-1 multiplexer can be implemented with acomponent that is almost a 6-to-1 multiplexer.

[0106]FIG. 9 illustrates an implementation of Technique G. It uses twomultiplexers, one 4-to-1 multiplexer for the input to a singlecarry-save adder, and a 3-to-1 multiplexer to produce the final output.As with the previous two implementations, this last multiplexer isalmost a 2-to-1 multiplexer.

[0107] Technique G can also be implemented in the manner illustrated inFIG. 8, where there is only one large multiplexer. However, splittingthe multiplexer in two parts, as illustrated in FIG. 9, may have someadvantages. First, the implementation illustrated in FIG. 9 uses onlyone carry-save adder, whereas implementation illustrated in FIG. 8 usesfour carry-save adders, which consume a significant amount of area andenergy. Second, the implementation of FIG. 9 avoids a large fan-in and alarge fan-out for the final multiplexer, assuming that stages arecascaded. The large fan-in and fan-out with one multiplexer slows downthe critical path for all of the alternatives. Splitting the multiplexerinto two decreases the critical path delay for the alternatives thatexclude the carry-save adder and it increases the critical path delayfor the alternatives that include the carry-save adder. Increasing thedifference between path delays for the respective alternatives may bebad for a synchronous circuit implementation, but an asynchronousimplementation may be able to take advantage of this difference byachieving an average-case delay that is less than the critical pathdelay of the implementation with the large multiplexer. This situationmay apply if the alternatives that exclude carry-save addition occurmore frequently than the alternatives that include carry-save addition.

[0108] The selection logic for each of the implementations is simple. Asan example, we present the equations for FIG. 5, where we assume that c₀and c₁ are the most and second-most significant bit of the carryrespectively, and so and s₁ are the most and second-most significant bitof the sum respectively. Below, the notation ⊕ denotes XNOR.

2 X*=s ₀ ⊕c ₀

2 X=s ₀ s ₁ c ₀ c ₁ +{overscore (s₀s₁c₀c₁)}

SUB1={overscore (s₀s₁c₀c₁)}+{overscore (s ₀)}s₁ {overscore (c₀c₁)}

SUB2={overscore (s₀)}s₁{overscore (c₀)}c₁

ADD1=s ₀ s ₁ c ₀ {overscore (c₁)}+s ₀ {overscore (s₁)}c ₀ c ₁

ADD2=s₀{overscore (s₁)}c₀{overscore (c₁)}

[0109] Concluding Remarks

[0110] All of the above-described techniques are easy to implement by asynchronous or asynchronous circuit. Techniques E and F take morerepetition steps to terminate than Technique G. How many more repetitionsteps these techniques need depends on the number of additions andsubtractions that the technique executes. We expect the number ofadditions and subtractions as a fraction of the number of doublings willbe around 0.5, based on some quick calculations and assuming uniformdistributions. This means that we expect that for every two doublingsthere will be one addition or subtraction. Simulations will show whatthe exact fraction is. Because Techniques E and F execute each additionand subtraction in a repetition step separate from a doubling, TechniqueE and F execute 50% more repetition steps than Technique G, if thenumber of additions and subtractions per doubling is 0.5. Although theTechnique G executes fewer repetition steps, this technique needs toconsider six alternatives in each repetition step, whereas Technique Eand F need to consider four alternatives only. The number ofalternatives to be considered in each repetition step may have someeffect on the execution time of the repetition step.

[0111] There are two ways in which the above-described techniques can begeneralized. Both generalizations consider the three most significantbits of sum and carry, which means there will be 64 small squaresinstead of 16. In one generalization, the divisor D is of the form D=01. . . and in the other generalization D is of the form D=001 . . . Inboth cases, the action for each of the squares is some combination ofthe actions 2X*, 4X*, 8X*, 2X, 4X, SUB1, SUB2, SUB3, SUB4, ADD1, ADD2,ADD3, and ADD4. We have not pursued any of these generalizations nor dowe know whether the extra delay in a repetition step due to the extracomplexity in selection logic, larger multiplexers, and larger driverswill be compensated by a further reduction in repetition steps.

[0112] We also have not discussed any other optimizations, such asoverlapping quotient-selection of successive stages, overlappingremainder formation of successive stages, or any hybrid of theseoptimizations. These techniques can be applied to all implementations.

[0113] The foregoing descriptions of embodiments of the presentinvention have been presented for purposes of illustration anddescription only. They are not intended to be exhaustive or to limit thepresent invention to the forms disclosed. Accordingly, manymodifications and variations will be apparent to practitioners skilledin the art. Additionally, the above disclosure is not intended to limitthe present invention. The scope of the present invention is defined bythe appended claims.

What is claimed is:
 1. A method for performing a carry-save divisionoperation, wherein the carry-save division operation divides anumerator, N, by a denominator, D, to produce an approximation of aquotient, Q=N/D, the method comprising: approximating Q by iterativelyselecting an operation to perform based on higher order bits of aremainder, r, and then performing the operation; wherein the operationcan include, subtracting D from r and adding a coefficient, c, to aquotient calculated thus far, q, or adding D to r and subtracting c fromq; wherein the subtraction and addition operations maintain r and q incarry-save form, which eliminates the need for carry propagation andthereby speeds up the division operation.
 2. The method of claim 1,wherein maintaining r in carry-save form involves maintaining a sumcomponent, r_(s), and a carry component, r_(c).
 3. The method of claim2, wherein maintaining q in carry-save form involves maintaining a sumcomponent, q_(s), and a carry component, q_(c).
 4. The method of claim3, further comprising initializing r, q and c; wherein initializing rinvolves setting r_(s)=R and r_(c)=0; wherein initializing q involvessetting q_(s)=0 and q_(c)=0; and wherein initializing c involves settingc=1.
 5. The method of claim 3, wherein after the iterations arecomplete, the method further comprises performing a carry completionaddition that adds q_(s) and q_(c) to produce Q in non-redundant form.6. The method of claim 2, wherein the operation can additionally includemultiplying both r_(s) and r_(c) by 2 and dividing c by
 2. 7. The methodof claim 2, wherein the operation can additionally include multiplyingboth r_(s) and r_(c) by 2, dividing c by 2 and then inverting the mostsignificant bits of r_(s) and r_(c).
 8. The method of claim 2, whereinthe operation can additionally include multiplying both r_(s) and r_(c)by 4, dividing c by 4 and then inverting the most significant bits ofr_(s) and r_(c).
 9. The method of claim 2, wherein the operation canadditionally include subtracting D from r_(s) and r_(c), adding c toq_(s) and q_(c), multiplying both r_(s) and r_(c) by 2, dividing c by 2and then inverting the most significant bits of r_(s) and r_(c).
 10. Themethod of claim 2, wherein the operation can additionally includesubtracting 2D from r_(s) and r_(c), adding 2c to q_(s) and q_(c),multiplying both r_(s) and r_(c) by 2, dividing c by 2 and theninverting the most significant bits of r_(s) and r_(c).
 11. The methodof claim 2, wherein the operation can additionally include adding D tor_(s) and r_(c), subtracting c from q_(s) and q_(c), multiplying bothr_(s) and r_(c) by 2, dividing c by 2 and then inverting the mostsignificant bits of r_(s) and r_(c).
 12. The method of claim 2, whereinthe operation can additionally include adding 2D to r_(s) and r_(c),subtracting 2c from q_(s) and q_(c), multiplying both r_(s) and r_(c) by2, dividing c by 2 and then inverting the most significant bits of r_(s)and r_(c).
 13. An apparatus that performs a carry-save divisionoperation through an iterative process, wherein the carry-save divisionoperation divides a numerator, N, by a denominator, D, to produce anapproximation of quotient, Q=N/D, the apparatus comprising: a selectionmechanism configured to select an operation to perform based on higherorder bits of a remainder, r; an execution mechanism configured toperform the selected operation; wherein the operation can include,subtracting D from r and adding a coefficient, c, to a quotientcalculated this far, q, or adding D to r and subtracting c from q;wherein the execution mechanism is configured to maintain r and q incarry-save form, which eliminates the need for carry propagation andthereby speeds up the division operation.
 14. The apparatus of claim 13,wherein the execution mechanism is configured to maintain r incarry-save form by maintaining a sum component, r_(s), and a carrycomponent, r_(c).
 15. The apparatus of claim 13, wherein the executionmechanism is configured to maintain q in carry-save form by maintaininga sum component, q_(s), and a carry component, q_(c).
 16. The apparatusof claim 13, further comprising an initialization mechanism configuredto initialize r, q and c; wherein initializing r involves settingr_(s)=R and r_(c)=0; wherein initializing q involves setting q_(s)=0 andq_(c)=0; and wherein initializing c involves setting c=1.
 17. Theapparatus of claim 15, wherein after the iterations are complete, theexecution mechanism is additionally configured to perform a carrycompletion addition that adds q_(s) and q_(c) to produce Q innon-redundant form.
 18. The apparatus of claim 14, wherein the operationcan additionally include multiplying both r_(s) and r_(c) by 2 anddividing c by
 2. 19. The apparatus of claim 14, wherein the operationcan additionally include multiplying both r_(s) and r_(c) by 2, dividingc by 2 and then inverting the most significant bits of r_(s) and r_(c).20. The apparatus of claim 14, wherein the operation can additionallyinclude multiplying both r_(s) and r_(c) by 4, dividing c by 4 and theninverting the most significant bits of r_(s) and r_(c).
 21. Theapparatus of claim 14, wherein the operation can additionally includesubtracting D from r_(s) and r_(c), adding c to q_(s) and q_(c),multiplying both r_(s) and r_(c) by 2, dividing c by 2 and theninverting the most significant bits of r_(s) and r_(c).
 22. Theapparatus of claim 14, wherein the operation can additionally includesubtracting 2D from r_(s) and r_(c), adding 2c to q_(s) andq_(c),multiplying both r_(s) and r_(c) by 2, dividing c by 2 and theninverting the most significant bits of r_(s) and r_(c).
 23. Theapparatus of claim 14, wherein the operation can additionally includeadding D to r_(s) and r_(c), subtracting c from q_(s) and q_(c),multiplying both r_(s) and r_(c) by 2, dividing c by 2 and theninverting the most significant bits of r_(s) and r_(c).
 24. Theapparatus of claim 14, wherein the operation can additionally includeadding 2D to r_(s) and r_(c), subtracting 2c from q_(s) and q_(c),multiplying both r_(s) and r_(c) by 2, dividing c by 2 and theninverting the most significant bits of r_(s) and r_(c).
 25. Theapparatus of claim 13, wherein the selection mechanism includes amultiplexer that selects between outputs of a number of functional unitsthat perform alternative operations in parallel.
 26. The apparatus ofclaim 13, wherein the execution mechanism includes one or morecarry-save adders.
 27. An computer system that performs a carry-savedivision operation through an iterative process, wherein the carry-savedivision operation divides a numerator, N, by a denominator, D, toproduce an approximation of a quotient, Q=N/D, the computer systemcomprising: a processor; a memory; a division unit within the processor;a selection mechanism within the division unit configured to select anoperation to perform based on higher order bits of a remainder, r; anexecution mechanism within the division unit configured to perform theselected operation; wherein the operation can include, subtracting Dfrom r and adding a coefficient, c, to a quotient calculated this far,q, or adding D to r and subtracting c from q; wherein the executionmechanism is configured to maintain r and q in carry-save form, whicheliminates the need for carry propagation and thereby speeds up thedivision operation.